Path: ...!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: Interpreter loops vs branch predictors Date: Mon, 07 Sep 2015 12:25:07 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 187 Message-ID: <2015Sep7.142507@mips.complang.tuwien.ac.at> References: <2015Aug10.194919@mips.complang.tuwien.ac.at> <2015Aug11.090812@mips.complang.tuwien.ac.at> <4f9355a3-4f22-4428-9ef7-c5fb3e54a1fd@googlegroups.com> Injection-Info: mx02.eternal-september.org; posting-host="d47d3421039fe8026514328ad0ebacae"; logging-data="30576"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19B6PxvVzfuGwTkieGCHRSd" X-newsreader: xrn 10.00-beta-3 Cancel-Lock: sha1://EMmKcrcKugh4t9QqfnfOBaWTY= Bytes: 9775 Bruce Hoult writes: >Do you not have a Haswell machine available? I'm sure someone will have if = >you have test binaries available. I now have access to a Core i7-4790K (Haswell); here are the results for the same things I have done before on the Ivy Bridge; Turbo Boost was disabled on the Haswell (i.e., it ran with 4GHz). Run time for individual benchmarks in seconds user time, branch misses reported for running them all. for j in ' --ss-number=0 --ss-states=0 --no-dynamic' ' --ss-number=0 --ss-states=0 --no-super' ' --ss-number=0 --ss-states=0'; do for k in gforth-fast-4.8-PR15242 gforth-fast-4.8; do i=$k$j; echo $i onebench.fs; perf stat -x " " -e branch-misses $i onebench.fs; done; done gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 --no-dynamic onebench.fs sieve bubble matrix fib fft 0.264 0.196 0.132 0.228 0.088 13039125 branch-misses gforth-fast-4.8 --ss-number=0 --ss-states=0 --no-dynamic onebench.fs sieve bubble matrix fib fft 0.112 0.152 0.092 0.164 0.060 11273919 branch-misses gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 --no-super onebench.fs sieve bubble matrix fib fft 0.120 0.160 0.088 0.156 0.052 11457651 branch-misses gforth-fast-4.8 --ss-number=0 --ss-states=0 --no-super onebench.fs sieve bubble matrix fib fft 0.112 0.152 0.088 0.152 0.056 11107473 branch-misses gforth-fast-4.8-PR15242 --ss-number=0 --ss-states=0 onebench.fs sieve bubble matrix fib fft 0.096 0.128 0.052 0.100 0.032 8690210 branch-misses gforth-fast-4.8 --ss-number=0 --ss-states=0 onebench.fs sieve bubble matrix fib fft 0.088 0.120 0.052 0.088 0.036 7687416 branch-misses gforth-fast-4.8-PR15242 is built with gcc-4.8, with one shared indirect jump (that is jumped-to with direct jumps, similar to what switch-based interpreters usually generate), while gforth-fast-4.8 has a separate indirect jump at the end of each VM instruction implementation; "--no-dynamic" (the first two) uses that directly. "--no-super" uses dynamic replication (and gives each replicated implementation its own indirect branch); and the last two runs use dynamic superinstructions with replication (also with a separate indirect branch at the end of each dynamic superinstruction). See <2015Aug10.194919@mips.complang.tuwien.ac.at> for a more detailed description. The branch mispredictions are indeed not that much different between versions (all better than the best on the Ivy Bridge), but there is still quite a bit of performance difference between the first two (just from the extra direct jump); dynamic replication does not give a better branch prediction on the Haswell, and consequently no better performance than gforth-fast-4.8 --no-dynamic. But superinstructions with replications do provide a nice speedup again probably due to fewer branches executed. So, if you conclude from the paper that optimizing the dispatch of interpreters is no longer relevant, that's not even true on Haswell (and certainly not on any other CPU). On mostr benchmarks the last version is more than twice as fast as the first one, and the first one is probably faster than a switch-based interpreter. But the paper is correct that the branch prediction of the Haswell is much better than that of earlier CPUs. The difference is most extreme for gforth-fast-4.8-PR15242 --no-dynamic: 71M mispredictions on Ivy Bridge, 13M on Haswell (Ivy Bridge results: <2015Aug11.090812@mips.complang.tuwien.ac.at>). Let's look at more substantial benchmarks, from the Forth appbench suite (Ivy Bridge results in the same posting <2015Aug11.090812@mips.complang.tuwien.ac.at>): Run time in seconds: shared non-shared --no- --no- --no- dynamic dynamic super default 3.332 2.440 2.276 1.468 benchgc 1.652 1.524 1.064 0.544 brainless 4.016 3.416 2.288 1.520 brew 3.420 3.236 2.232 1.232 cd16sim 2.956 2.684 1.484 0.864 fcp 13.128 9.848 9.280 7.840 lexex Again, speedup factors 1.67-3.42 from interpreter optimizations from the most switch-like version to the Gforth default, and also good speedups for separate indirect jumps over the shared indirect jump. For mispredictions the picture was as follows: shared non-shared --no- --no- --no- dynamic dynamic super default 24,596,739 17,005,970 12,584,698 9,426,835 benchgc 163,953,064 178,234,917 51,375,412 18,391,920 brainless 281,856,378 279,851,195 47,928,843 21,390,209 brew 268,684,916 297,000,355 70,879,605 22,105,620 cd16sim 255,802,864 293,473,631 33,496,775 17,280,944 fcp 39,426,063 12,267,065 8,269,104 6,885,712 lexex We don't see a consistent benefit from the separate indirect jumps here; replication does help a lot for a number of benchmarks, though, and dynamic superinstructions with replication even more. These differences in branch prediction can explain the differences in run-time between the second and the third column. So, even with Haswell's branch predictor, there are significant branch prediction benefits from replication resulting in significant speedups. I'll leave the explanation of these results to the branch predictor experts. The difference from Ivy Bridge <2015Aug11.090812@mips.complang.tuwien.ac.at> is quite big, especially for the first lexex column: 2590M mispredictions on Ivy Bridge, 39M on Haswell. Finally, the results from my interpreter dispatch microbenchmark (cf. <2015Aug11.160837@mips.complang.tuwien.ac.at> for Ivy Bridge results): subrout direct indirect switch call repl-sw 0.00 0.02 0.02 0.14 0.15 0.11 Branch mispredictions: for i in subroutine direct indirect switch call repl-switch; do perf stat -x " " -e branch-misses ./$i |& awk '{printf("%10d '$i'\n",$1)}'; done 3835 subroutine 3302 direct 3863 indirect 9083 switch 4419 call 6018 repl-switch Close to perfect branch prediction for all variants (there are 100M indirect branches/calls in the benchmarks except subroutine). There is still a speed difference, probably due to the additional overhead of switch, call, and repl-switch: direct and indirect use 1 cycle/VM instruction, switch 5.8, call 6.1, and repl-switch 4.4. It's interesting that, even without the branch prediction advantage that was the motivation for repl-switch, repl-switch is still quite a bit faster than switch (at least for this benchmark, for bigger programs the larger cache load from the replicated switch tables could become a problem. My conclusion: Haswell's indirect branch prediction is really much better than before, but for larger programs running on an interpreter like Gforth, replication still provides substantial branch prediction improvements that result in significant speedups. And while there is no longer a branch prediction advantage to keeping separate indirect branches for dispatch, there is still a significant speed advantage; and dynamic superinstructions also provide a good speedup, resulting in overall speedups by factors 1.67-3.42 for the application benchmarks. Why are the results here different from those in the paper? 1) Different Interpreter 2) different benchmarks. If you write an interpreter, and look for performance, should you go for interpreter optimizations like threaded-code, replication, and dynamic superinstructions like I suggest, or just use a switch-based interpreter like the paper suggests? Threaded code is a good idea with little cost in any case. If that provides a significant speedup and your VM instruction implementations are short (you run into cache trouble with long VM instructions [vitale&abdelrahman04]), then replication with superinstructions will probably give a good speedup. @InProceedings{vitale&abdelrahman04, author = {Benjamin Vitale and Tarek S. Abdelrahman}, title = {Catenation and Specialization for {Tcl} Virtual Machine Performance}, crossref = {ivme04}, pages = {42--50}, year = {2004}, OPTannote = {} } @Proceedings{ivme04, booktitle = {IVME '04 Proceedings}, title = {IVME '04 Proceedings}, year = {2004}, OPTeditor = {} } - anton -- M. Anton Ertl Some things have to be seen to be believed ========== REMAINDER OF ARTICLE TRUNCATED ==========